Skip to content

[ENH]: Plumb prefix for spann and hnsw segment #4753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: 06-03-_enh_plumb_prefix_path_all_the_way_to_the_bf_writer
Choose a base branch
from

Conversation

sanketkedia
Copy link
Contributor

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

@sanketkedia sanketkedia marked this pull request as ready for review June 4, 2025 18:06
Copy link

github-actions bot commented Jun 4, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

sanketkedia commented Jun 4, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

Copy link
Contributor

propel-code-bot bot commented Jun 4, 2025

Plumb Prefix Path Support Through Segment, HNSW, and Blockstore Codebase

This PR introduces comprehensive propagation of path prefixes into segment and HNSW/hnswlib-related data storage and retrieval. Paths are now constructed and threaded through all I/O layers for segment-based, blockfile, and hnsw indices. This affects construction, read, write, fork, flush, prefetch, and blockfile location-ensuring distinct collection isolation, cross-cloud compatibility, and addressing future requirements for prefix-based multi-tenancy in storage.

Key Changes:
• Introduced and threaded prefix_path through all relevant methods and types in segment, HNSW index, blockfile provider, blockfile writer/reader, migration, and test helpers.
• Changed API and usage of BlockfileReaderOptions and BlockfileWriterOptions to require and correctly inject a prefix_path, propagating this through ArrowBlockfileProvider, HashMapBlockfileProvider, HnswIndexProvider, and RootManager.
• Adjusted all places (I/O, migration, fork, flush, cache, tests, GC) where file/block/sparse index keys or paths are constructed, to properly prepend or parse prefix_paths.
• Updated error handling and validation to check prefix consistency across a segment's blockfile paths (with explicit error cases for prefix mismatches).
• All tests, helpers, and garbage collector logic adapted to use the new path prefix semantics.

Affected Areas:
• blockfile provider and ArrowBlockfileProvider
• segment type methods and prefix construction
• distributed_spann, distributed_hnsw, and blockfile_hnsw segments
• HnswIndexProvider and index read/write/fork/flush
• test_helpers, test suites, and property-based random segment/gen helpers
• garbage collector operators (file listing, unused files, delete operations)
• root_manager, block_manager, and migration code
• all S3/block path manipulations (including in proptests & orchestrator)

Potential Impact:

Functionality: All code paths that read, write, fork, or flush blocks/index/data now require and use a correct segment and prefix_path context; incorrect or missing prefixes will produce errors or lead to storage isolation violations.

Performance: Prefetch, cache, and block location performance may see minor penalties due to added prefix computations, though most changes are in I/O glue and not hot path.

Security: Stronger isolation between tenants/collections due to forced prefix scoping in storage access patterns.

Scalability: Supports future multi-cloud or multi-tenant scale-out via path partitioning; failure to provide/validate prefixes may break existing deployments not migrated to prefix-aware storage.

Review Focus:
• Correctness of prefix threading in all blockfile read/write, hnsw, distributed segment, and GC file usage.
• Correct error handling for prefix mismatches in segment file_path.
• Correct serialization/deserialization and root_manager handling for prefixed data.
• Edge cases in test helpers/proptests around random prefix and block id construction.
• Compatibility with older data/segments (where prefix is empty string) and safe migration.

Testing Needed

• Full test pass for rust/segment, rust/index, and rust/blockstore (including proptests).
• Integration test runs for segment creation/fork/flush, read/write, GC, and multi-collection setups.
• Manual migration/testing of real storage environments with legacy and new-style (prefixed) data.
• Validation of error cases when prefixes are missing, mismatched, or corrupted.
• Backwards compatibility checks with legacy paths (empty prefix_path) for bootstrapped data.

Code Quality Assessment

rust/segment/src/blockfile_record.rs: Mostly safe refactoring and threading. Large function parameter lists now take prefix_path from segment. Error enums and match arms are updated properly.

rust/index/src/spann/types.rs: Careful propagation of prefix_path, additional arguments in multi-argument methods, explicit validation of consistency.

rust/blockstore/src/arrow/blockfile.rs: All block get/fork/load/flush paths updated for new path, tests adapted. Clean use of new BlockfileReaderOptions for prefix clarity.

rust/garbage_collector/src/operators/compute_unused_files.rs: Careful path parsing and key generation per new logic. Adequately tested for edge cases.

error-handling: More explicit error types (~InvalidPrefixPath) and guard rails to prevent inconsistent state; generally improved error reporting.

Best Practices

Modularization:
• Decouples path construction/validation logic into segment and provider.
• Introduces explicit error typing for cross-segment invariant enforcement.

Type Safety:
• Correct use of newtyped BlockfileReaderOptions for call site clarity.
• Prefixed block and root key serialization improves API contracts.

Test Coverage:
• Proptests, edge-case integration tests, and GC tests all fully upgraded and maintained.

Potential Issues

• If upstream code fails to call new functions with the right prefix_path, data corruption or cross-collection leaks are possible.
• Some error cases (legacy paths, empty prefix, or upgrade scenarios) might be missed in migration, requiring careful operational attention.
• Tests relying on path matching may break if hardcoded non-prefixed paths are present.
• Added parameter requirements will break downstream or callsite code if not updated.

This summary was automatically generated by @propel-code-bot

@sanketkedia sanketkedia force-pushed the 06-03-_enh_plumb_prefix_path_all_the_way_to_the_bf_writer branch from 573315d to c24e5b0 Compare June 10, 2025 15:42
@sanketkedia sanketkedia force-pushed the 06-04-_enh_plumb_prefix_for_spann_and_hnsw_segment branch from 7c0cc19 to 46411af Compare June 10, 2025 15:42
@sanketkedia sanketkedia force-pushed the 06-03-_enh_plumb_prefix_path_all_the_way_to_the_bf_writer branch from c24e5b0 to a220d89 Compare June 10, 2025 15:48
@sanketkedia sanketkedia force-pushed the 06-04-_enh_plumb_prefix_for_spann_and_hnsw_segment branch from 46411af to 6c82dd4 Compare June 10, 2025 15:48
@sanketkedia sanketkedia requested a review from HammadB June 10, 2025 15:53
@sanketkedia sanketkedia force-pushed the 06-04-_enh_plumb_prefix_for_spann_and_hnsw_segment branch from 6c82dd4 to d29be4b Compare June 10, 2025 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant